Below we describe the re-analysis reported on the paper “Arguing Practical Significance in Empirical Software Engineering”. Note that this R Markdown file can be turned into an R script to be modified in case readers wish to change our models, their parameters, or even adapt it to your own dataset. If so, recall to reference our paper:

@article{torkar2020arguing,
  title = {Arguing Practical Significance in Empirical Software Engineering},
  author= {Torkar, Richard and Feldt, Robert and de Oliveira Neto, Francisco Gomes and
           Furia, Carlo A and Gren, Lucas and Lenberg, Per and Ernst, Neil D.},
  journal= {arXiv preprint arXiv:1809.09849},
  year   = {2020}
}

Below, we re-analyse the study published by Afzal et al. (2015), that compares whether exploratory testing reveals more faults than test case based test. Additionally, the study also investigates whether different levels of experience affects the number of faults revealed. Authors of the original study used frequentists statistics to compare the techniques, whereas we use Bayesian Data Analysis (BDA) and Cumulative Prospect Theory (CPT).

Our re-analysis is divided into several parts based on our manuscript. Here, we focus on presenting reproducible code, particularly, for our BDA workflow and usage of CPT. Discussion and findings are found in the paper.

1 Setup

1.1 Packages and functions

The code below simply cleans your environment to avoid loading unnecessary functions or variables and loads the libraries used in our script. We begin by installing and loading the required packages. For BDA, we use mainly Bürkner (2019), whereas Gabry and Mahr (2019) and Makowski et al. (2020) provide support with various plots and functions to calculate credible intervals.

rm( list = ls() )  # Cleans the environment.

# You can install packages in case they are not installed already.
# install.packages( c("tidyverse", "likert", "gridExtra", "kableExtra",
#                     "brms", "bayesplot", "bayestestR") 
# The CPT package is installed from source.
# install.packages("pt_1.1.tar.gz", repos=NULL, type="source")

library(tidyverse) # For transforming and visualizing data.
library(gridExtra) # Combining many plots into the same figure.
library(kableExtra)# Rendering tables to look nicer.

library(brms)      # BDA packages. Alternatively, one can use rethinking & rstanarm.
library(bayesplot) # Plotting BDA output by Gabry et al.
library(bayestestR)# Get the HDI values
bayesplot::color_scheme_set("viridis") #Uses the viridis palette on bayesplots.

library(pt)        # The Cumulative Prospect Theory (CPT) package.
library(likert)    # Manipulates and plot data on likert scales.
library(readxl)    # Reads data from Excel files

For reproducibility and efficiency we set an arbitrary seed and the number of cores to speed up the MCMC sampling.

SEED <- 61215
set.seed(SEED)
options(mc.cores = parallel::detectCores())

Finally, we created many utils functions used throughout this document to simplify our code and improve readiability/reuse. The code for all utils functions are presented in the Appendix (Section 6).

1.2 Overview of the dataset

We begin by loading and exploring our dataset. The data reveals the number of faults (i.e., true positives, or tp) revealed by subjects with different levels of experience using two distinct testing techniques. We consider the following levels:

  • ME: More experienced tester.
  • LE: Less experienced tester.
  • ET: Exploratory tesitng.
  • TCT: Test case based testing.
# If you intend to run the script yourself, remember to check the file path is correct.
raw_data <- read.csv2("./datasets/experiment_data.csv", sep=";")

1.2.1 Descriptive statistics

1.2.2 Distributions (density plots)

Note that there does not seem to be a significant difference in the number of faults revealed when comparing less (LE) and more experience (ME) subjects. Conversely, that difference is more apparent when comparing ET and TCT.

1.2.3 Raw data


2 BDA workflow

2.1 Creating the model

Similar to the original study, we want to answer:

  • Is there a technique that reveals more faults between ET and TCT?
  • Is there a difference between less and more experienced subjects?

Therefore, we are interested in tp as an outcome by looking at three predictors: technique, experience and subject. Moreover, there could also be confounding variables leading to interaction effects. We define the two models below to predict tp by building a generalised linear model (GLM) based on our three predictors but with different likelihood and parameters.


Defining M1

\(M_1\) assumes that tp follows a Poisson distribution with a rate of fault detection \(\lambda\) and that both èxperience and technique have an effect on the rate of fault detection. Since we do not have prior information, we set the weakly informed prior \(N(0, 1.5)\) for all parameters. Next, we present both the mathematical notation and the corresponding code for \(M_1\).

\[ \begin{split} \mathcal{M}_1: \mathrm{tp} &\thicksim \mathrm{Poisson(\lambda)} \\ \mathrm{log(\lambda)} &= \alpha + \beta_t \times \mathrm{tech} + \beta_e \times \mathrm{exper}\\ \alpha, \beta_t,\beta_e &\thicksim \mathrm{Normal(0,1.5)} \\ \end{split} \]

m1 <-brm(formula = tp ~ technique + experience + subject,
         data = raw_data,
         prior= c(set_prior("normal(0,1.5)", class = "Intercept"),
                  set_prior("normal(0,1.5)", class = "b")),
         family = poisson(link=log), 
         chains = 4, refresh = 0, seed = SEED, sample_prior = "yes")

Defining M2

\(M_2\) also assumes a Poisson distribution, but accounts for the unusual large amount of zero values in the data (tp = 0). Note that in 13 out of 70 observations (20%) in our dataset revealed no faults. Therefore, we choose a Zero-inflated Poisson likelihood with the rate of fault detection (\(\lambda\)) and the probability \(p_i\) of finding no faults. We also consider varying intercepts for the subjects, i.e., we expect that no matter where, how, or how many participants we use, we would still have the same levels in techniqueand experience.

\[ \begin{split} \mathcal{M}_2: \mathrm{tp} \thicksim &\ \mathrm{ZIPoisson}(p_i, \lambda) \\ \log(\lambda) = &\ \alpha + \beta_t \times \mathrm{tech} + \beta_e \times \mathrm{exper}+ \alpha_{subject[i]}\\ \mathrm{logit}(p_i) =&\ \alpha_p + \beta_p \times \mathrm{tech} \\ \alpha_{subject[i]} \thicksim&\ \mathrm{Normal(\mu_s,\sigma_s)} \\ \alpha, \alpha_p, \beta_p, \beta_t, \beta_e, \mu_s \thicksim &\ \mathrm{Normal(0,1.5)} \\ \sigma_s \thicksim &\ \mathrm{HalfCauchy(0,1)} \\ \end{split} \]

Similarly to \(M_1\), \(M_2\) has technique and experience as predictors of tp. However, \(M_2\) has varying intercepts for each subect in (1 | subject) and use a different likelihood, namely, zero_inflated_poisson. The varying intercepts also need a prior, and here we use a cauchy(0,1) since variances only take positive values.

m2 <- brm(bf(formula = tp ~ 1 + technique + experience + (1 | subject), zi ~ technique),
          data = raw_data,
          prior = c(set_prior("normal(0,1.5)", class="b"),
                    set_prior("normal(0,1.5)", class="Intercept", dpar="zi"),
                    set_prior("cauchy(0,1)", class="sd")),
          family = zero_inflated_poisson(link=log),
          chains = 4, refresh = 0, seed = SEED, sample_prior = "yes")

2.2 Sensitivity analysis

After compiling and sampling the models, we can check wheter our priors look sane. So, we sample from the models above, but using the option sample_prior = "only".

2.2.1 Prior analysis for M1

# Same as m1, but we change to only sample priors.
m1_priors <- brm(formula = tp ~ technique + experience + subject, data = raw_data,
         prior = c(set_prior("normal(0,1.5)", class = "Intercept"),
                   set_prior("normal(0,1.5)", class = "b")),
         family = poisson(link=log),
         chains = 4, refresh = 0, seed = SEED,
         sample_prior = "only")

\(M1\) has a very high variance in the Intercept \([-53.44, 52.80]\), which is inconsistent to the variance of the other population-level effects (see the sampled priors summary). This is an indication that we should use varying intercepts for the subjects, which is already covered by \(M2\).


M1 - Prior summary

##  Family: poisson 
##   Links: mu = log 
## Formula: tp ~ technique + experience + subject 
##    Data: raw_data (Number of observations: 70) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept       -0.17     27.25   -53.44    52.80       4307 1.00
## techniqueTCT    -0.01      1.54    -3.11     3.00       4557 1.00
## experienceME    -0.04      1.48    -2.95     2.89       4780 1.00
## subject          0.01      1.51    -2.96     2.94       4294 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

M1 - Prior distributions

2.2.2 Prior Analysis for M2

m2_priors <- brm(bf(formula = tp ~ 1 + technique + experience + (1 | subject), 
                    zi ~ technique),
          data = raw_data,
          prior = c(set_prior("normal(0,1.5)", class="b"),
                    set_prior("normal(0,1.5)", class="Intercept", dpar="zi"),
                    set_prior("normal(0,1.5)", class="b", dpar="zi"),
                    set_prior("cauchy(0,1)", class="sd")),
          family = zero_inflated_poisson(link=log),
          chains = 4, refresh = 0, seed = SEED, sample_prior = "only")

Similarly to the sensitivity analysis in M1, there is a high variance in the Intercept \([-31.23, 32.38]\), however, that is being captured as a group-level effects, as opposed to a population-level effect (see ~subject Group-level Effects in sampled priors summary).

M2 - Priors summary

##  Family: zero_inflated_poisson 
##   Links: mu = log; zi = logit 
## Formula: tp ~ 1 + technique + experience + (1 | subject) 
##          zi ~ technique
##    Data: raw_data (Number of observations: 70) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Group-Level Effects: 
## ~subject (Number of levels: 35) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept)     6.33    131.88     0.04    19.89       3657 1.00
## 
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept           1.06     15.52   -31.23    32.38       2729 1.00
## zi_Intercept        0.05      1.67    -3.17     3.33       5282 1.00
## techniqueTCT        0.01      1.50    -2.91     2.98       5981 1.00
## experienceME       -0.02      1.54    -2.97     3.04       6469 1.00
## zi_techniqueTCT    -0.00      1.50    -3.00     2.94       6295 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

M2 - Priors distribution


2.2.3 Checking our posterior

In order to evaluate the predictions performed by our model, we must verify that:

  • \(\hat{R}\)<1.01 and therefore our posterior distribution is not biased.
  • Our effective sample size (Eff. Size) should not be less than 10–20% of the total sample size (\(4,000\) sampled values). Otherwise, we would have auto-correlation within the chains that sample the posterior values.
  • We have hairy caterpillar when plotting the traces of our chains. This indicates that the chains have mixed well.

M1 - Posterior Summary

The summary indicates good results both for \(\hat{R}\) and effective sample size.

m1
##  Family: poisson 
##   Links: mu = log 
## Formula: tp ~ technique + experience + subject 
##    Data: raw_data (Number of observations: 70) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept        2.12      0.14     1.83     2.39       2432 1.00
## techniqueTCT    -1.74      0.15    -2.05    -1.44       2230 1.00
## experienceME     0.52      0.21     0.12     0.93       1853 1.00
## subject         -0.01      0.01    -0.03     0.01       1810 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

M1 - Trace plots

The trace plots look like hairy caterpillars, indicating that the chains have mixed well.

chosen_parameters <- c("b_Intercept", "b_techniqueTCT", "b_experienceME", "b_subject")
mcmc_trace(m1, pars = chosen_parameters) + PLOT_THEME + theme(legend.position = "None")

M2 - Posterior summary

The summary indicates good results both for \(\hat{R}\) and effective sample size. Note that the subjects are now a group-level effect, instead of a population-level effect.

m2
##  Family: zero_inflated_poisson 
##   Links: mu = log; zi = logit 
## Formula: tp ~ 1 + technique + experience + (1 | subject) 
##          zi ~ technique
##    Data: raw_data (Number of observations: 70) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Group-Level Effects: 
## ~subject (Number of levels: 35) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept)     0.28      0.09     0.10     0.46       1242 1.00
## 
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept           1.95      0.10     1.75     2.15       3812 1.00
## zi_Intercept       -4.66      1.39    -8.02    -2.53       3147 1.00
## techniqueTCT       -1.47      0.17    -1.83    -1.14       4125 1.00
## experienceME        0.32      0.16     0.01     0.62       3241 1.00
## zi_techniqueTCT     3.48      1.63     0.62     7.11       3118 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

M2 - Trace plots

The trace plots look like hairy caterpillars, indicating that the chains have mixed well.

chosen_parameters <- c("b_Intercept", "b_techniqueTCT", "b_experienceME",
                       "sd_subject__Intercept", "b_zi_Intercept", "b_zi_techniqueTCT")
mcmc_trace(m2, pars = chosen_parameters) + PLOT_THEME + theme(legend.position = "None")


We use the function ppc_dens_overlay to plot how our predictions fit the actual data. Note that \(M2\) has a better fit, specially when closer to zero tp, indicating that the a zero-inflated poisson is a better choice for likelihood.

custom_sample = 50 # We predict 50 values.
fit_m1 <- ppc_dens_overlay(yrep = posterior_predict(m1, nsamples = custom_sample),
                           y    = raw_data$tp)

fit_m2 <- ppc_dens_overlay(yrep = posterior_predict(m2, nsamples = custom_sample),
                           y    = raw_data$tp)
Fit of our predictions using M1 and M2

Figure 2.1: Fit of our predictions using M1 and M2

In conclusion, \(M2\) fits our data better (Figure @ref{fig:bda-diag-fit}) and captures group-level variance. It would be risky to use \(M1\) with its high variance of \(\alpha\) modelled as a population-level effect (Section @ref{bda-priors}). Therefore, we continue our analysis only with \(M2\).


2.3 Conducting inference

After checking that our model is sane, we can start using the posterior distribution to make predictions of the number of faults revealed by each technique and different levels of experience. The steps are: 1) get the posterior distribution, 2) define the levels to predict, 3) run the model with the posterior values to get our outcome tp.

# 1- Getting a posterior distribution
#    Gets the posterior distribution from the model. Those are values for the
#    parameters. This is what brms uses to do predictions of number of faults (tp)
posterior_dist <- posterior_samples(m2)

# 2- Defining which predictors (levels) to use.
#    Defines the levels that we are interested in sampling. In our case, all
#    combinations of experiences and techniques (ET x LE, ET x ME, TCT x LE, TCT x ME).
predictions_df <- data.frame(technique =c("ET", "ET", "TCT", "TCT"),
                             experience=c("LE", "ME", "LE" , "ME"))

# 3- Run the model
#    Push our posterior through the model using the specified 
#    levels to obtain the outcome variable.
levels_pred_df <- fitted(m2, newdata = predictions_df,re_formula=NA,summary=FALSE) 

# Rounds tp values, covert it to DF and then rename columns
levels_pred_df <- levels_pred_df %>% round %>% as.data.frame
colnames(levels_pred_df)<- c("LE_ET","ME_ET","LE_TCT","ME_TCT")

# Do some data wrangling to convert the combined levels to "per-level" predictions.
# We will use this to plot marginal effects of each level.
marginal_pred_df <- get_individual_levels(levels_pred_df)

Raw posterior distribution

posterior_dist

Predictions (combined levels)

levels_pred_df

Predictions (individual levels)

marginal_pred_df


We can see the marginal effect of each factor by looking at how much of our posterior overlaps zero. In other words, we want to see which predictors have a positive or negative effective in the desired variable, as opposed to no effect. This can be easily done by using the function mcmc_areas from the bayesplot package.

Since the posterior is a distribution, we look at the HDI (high density interval), which shows the points that cover most of the distribution. In our case, we specify an HDI (also referred to as credible interval–CI) of 94% shown as the highlighted area.

Below, we see that the Technique variable has a negative effect, meaning one of the factors (in this case TCT, since the posterior samples from TCT) actually reduces the number of faults. The interesting case here is the Experience variable because it is very close to zero. This indicates that the effect of experience in revealing number of faults can be very small, almost non-significant.

In the original study, authors concluded that experience has no significant effect in detecting faults. Our BDA analysis shows indeed that the effect is very small, but the conclusions are more nuanced, since the majority of the posterior is still on the positive scale (particularly, for more experienced testers).

plot_posterior_region(m2, c("b_techniqueTCT", "b_experienceME"), 
                      c("Technique", "Experience"))

We can also visually compare the predictors by plotting our predictions. Below we show descriptive statistics, as well as the number of faults for a 94% credible interval for each level (combined and marginal). Note that we will use these values later (CI-low, Median, CI-high) in our CPT framework to support decision making.

marginal_summary <- marginal_pred_df %>% gather(key="Predictors", value="Faults") %>%
  group_by(Predictors) %>% 
  summarise(Median = median(Faults), Mean = mean(Faults), SD = sd(Faults))

levels_summary   <- levels_pred_df %>% gather(key="Predictors", value="Faults") %>%
  group_by(Predictors) %>% 
  summarise(Median = median(Faults), Mean = mean(Faults), SD = sd(Faults))

hdi_levels     <- hdi(levels_pred_df, ci = 0.94)
hdi_marginal   <- hdi(marginal_pred_df, ci = 0.94)
hdi_predictions<- bind_rows(hdi_marginal, hdi_levels)%>%rename("Predictors" = Parameter)

full_summary <- bind_rows(marginal_summary, levels_summary)
full_summary <- left_join(hdi_predictions, full_summary, by="Predictors")

Effects (summary table)

Table 2.1: Summary of predicted faults for each variable.
Predictors CI CI_low CI_high Median Mean SD
ME 94 1 11 4.0 5.663625 4.0749604
LE 94 1 8 3.5 4.036750 2.9913238
ET 94 6 11 8.0 8.298000 1.6857643
TCT 94 1 2 1.0 1.402375 0.5071997
LE_ET 94 6 8 7.0 6.970500 0.7666660
ME_ET 94 8 12 10.0 9.625500 1.2534522
LE_TCT 94 1 2 1.0 1.103000 0.3039969
ME_TCT 94 1 2 2.0 1.701750 0.4928059

Effects (boxplots)

Our analysis of the predictions reveal that:

  • The choice of technique has a high effect on the number of faults, such that ET can reveals more faults than TCT (e.g., a median of 10 more faults).
  • Practitioner’s experience has very little effect overall, but more experienced practitioners tend to reveal slightly more faults. However, less experienced participants using ET performed better than more experience participants using TCT (LE_ET >>> ME_TCT).
  • Those differences are still in terms of few faults. For instance, the higher median is 11 faults, and many vary in only in one or two more faults. So, the cost of fixing those faults will likely be a more determining factor for adoption.

In conclusion, ET should be used over TCT and experience can have an effect depending on the cost of the fault. In comparison to the original experience, BDA seemed to enable more nuanced analysis because it is based on distributions as opposed to estimates such as \(p\)-values. For instance, results from experience are not “clearly” significant or non-significant, rather they will depend on context information, such as costs of fixing the faults, or salaries of more experienced developers.


3 Applying CPT

We use the Au (2014) package for calculations of prospects and utility. Therefore, we start by defining the following choice problems:

  • CP1: Should a manager adopt ET or TCT?
  • CP2: Should a manager hire more (ME) or less experienced (LE) testers?
  • CP3: After chosing the best technique, should she hire more (ME) or less experienced (LE) testers?

Since the original paper does not include context information about the costs of fixing a fault, or salaries of engineers, we will assume that a technique that detects more faults should compensate the costs of bug fixing after release, i.e., testers can fix them when they are reported later:

\[ \begin{split} \nu(x_i) = (\texttt{tp} \times \texttt{fault-cost}) - (\mathrm{\texttt{hours-per-session}} \times \mathrm{\texttt{hourly-salary}}) \end{split} \]

Here, we consider a regular and a critical scenario for fault-cost, in addition to fixed values for the hourly salaries.

Description Variables Values
Cost of a regular fault fault-cost $150
Cost of a critical fault fault-cost $1000
Cost of debugging with a less experienced tester hourly-salary $100
Cost of debugging with a more experienced tester hourly-salary $200
Duration of a debugging session hours-per-session 4
#--------------------------------------------------------------------
# Setting values and functions for our context.
#--------------------------------------------------------------------

# This is the function that calculates the costs of a fault: v(xi)
# It is used to get the subjective value of a prospect (v(xi)).
FAULT_COST <- 150

SAL_LE <- 100 # $100/h hourly salary for LE
SAL_ME <- 200 # $200/h hourly salary for ME
SESSION_TIME  <- 4 # Num hours of work in a session: 4h
ME_STR <- "ME"
LE_STR <- "LE"
ET_STR <- "ET"
TCT_STR<- "TCT"

get_vi_cost <- function(tp, session_costs){
  total_fault_cost <- (tp * FAULT_COST) - session_costs
}

We begin by loading the package and configuring our variables. After defining the variables and equations above, we model our choice problem in terms of three possible outcomes for each level. The outcomes are:

  1. 3% probability of doing better than expected and revealing more faults.
  2. 94% probability of revealing the expected number of faults (median).
  3. 3% probability of doing worse than expected and revealing few faults.

Note that those values are connected to our choice of HDI (94% credible interval) from Table 2.1. If the outcomes are based on other probability values, then the HDI should be calculated again with the new values.

#--------------------------------------------------------------------
# Configuring framework for pt package labels for both choice problems
#--------------------------------------------------------------------

# For each choice, we consider three cases, in connection to our HDI (3%, 94%, 3%).
probability_strings <- c("0.03", "0.94", "0.03", 
                         "0.03", "0.94", "0.03")

choice_ids <- c(1,1,1, 1,1,1) # Since we consider 2 levels x 3 cases, we have 6 choices.
gamble_ids <- c(1,1,1, 2,2,2) # Our choices are divided into two branches (e.g. MEvsLE),
outcome_ids<- c(1,2,3, 1,2,3) # and we have three distinct outcomes in each branch,

#--------------------------------------------------------------------
# This is simply a utility function that retrieves tp values from our HDI
#--------------------------------------------------------------------
get_fault_values <- function(summary_df, level){
  level_row <- summary_df %>% filter(Predictors == level)
  value_high_pred<- round(level_row[1,"CI_high"] + 1)
  value_med_pred <- round(level_row[1, "Median"]    )    
  value_low_pred <- round(level_row[1, "CI_low"] - 1)

  faults_from_level <- c(value_high_pred, value_med_pred, value_low_pred)
  return(faults_from_level)
}

3.1 CP1 - ET vs. TCT

In order to calculate the cost of using each technique independently of the participant’s experience, we use an estimated cost based on our dataset. We had a total of 35 participants (23 LE and 12 ME) using both techniques, hence we average the cost of their bug fixing session.

#--------------------------------------------------------------------
# objective consequences for the *different techniques* - CP1
#--------------------------------------------------------------------
LE_SUB <- 23
ME_SUB <- 12
avg_cost_session <- ((LE_SUB * SAL_LE) + (ME_SUB * SAL_ME) ) / (LE_SUB + ME_SUB)
avg_cost_session <- round(avg_cost_session,2)

et_outcomes <- get_vi_cost(get_fault_values(full_summary, ET_STR) , avg_cost_session)
tct_outcomes<- get_vi_cost(get_fault_values(full_summary, TCT_STR), avg_cost_session)

techniques_consequences <- c(et_outcomes, tct_outcomes)

# Let's put everything together in a Choices object.
choices_techniques <- Choices(choice_ids = choice_ids, gamble_ids = gamble_ids,
                              outcome_ids= outcome_ids,
                              objective_consequences = techniques_consequences,
                              probability_strings = probability_strings)

# Plot the decision problem for the CP1.
plot_choice_problem(choices_techniques, c(ET_STR, TCT_STR, "T"))
Each level has three options showing, respectively, the best (top), median and worse (low) case. ET has better profit than TCT in all cases.

Figure 3.1: Each level has three options showing, respectively, the best (top), median and worse (low) case. ET has better profit than TCT in all cases.

From the figure above, note that ET yields significantly more profits in all outcomes, whereas TCT can actually lead to losses in a pessimistic case (TCT:3 = -$134.29). So, assuming our context, ET should be chosen.


3.2 CP2 - ME vs. LE

Here, we consider the impact of experiences indepently of which technique is being used.

#--------------------------------------------------------------------
# Objective consequences for the *levels of experience* - CP2
#--------------------------------------------------------------------
me_outcomes <- get_vi_cost(get_fault_values(full_summary, ME_STR), 
                           SESSION_TIME * SAL_ME)
le_outcomes <- get_vi_cost(get_fault_values(full_summary, LE_STR), 
                           SESSION_TIME * SAL_LE)

experience_consequences <- c(me_outcomes, le_outcomes)

# Let's put everything together in a Choices object.
choices_experience <- Choices(choice_ids = choice_ids, gamble_ids = gamble_ids,
                              outcome_ids= outcome_ids,
                              objective_consequences = experience_consequences,
                              probability_strings = probability_strings)

#Plot the decision problem for the CP2.
plot_choice_problem(choices_experience, c(ME_STR, LE_STR, "E"))
Profits and losses for the best (top), median and worse (low) cases in each choice. The outcomes related to ME lead to losses, whereas ME results in profit.

Figure 3.2: Profits and losses for the best (top), median and worse (low) cases in each choice. The outcomes related to ME lead to losses, whereas ME results in profit.

From the figure above, note that it is more profitable, on average, to hire less experienced testers (LE:94 = $200) since more experienced testers lead to an average loss (ME:94 = -$200). Even though they have a 3% optimistic chance of yielding high profits (ME:3 = $1000), the difference is very small for the corresponding case with less experienced testers (LE:3 = $950).


3.3 CP3 - ET + (ME vs. LE)

Restuls from CP1 shows that ET is significantly more profitable than TCT, whereas CP2 shows that the experience has no significant impact on the number of revealed faults. The follow-up question is whether the experience is a relevant factor when combined with ET.

#--------------------------------------------------------------------
# Objective consequences for experienve GIVEN that we choose ET - CP3
#--------------------------------------------------------------------

et_me_outcomes<-get_vi_cost(get_fault_values(full_summary,"ME_ET"),
                            SESSION_TIME * SAL_ME)
et_le_outcomes<-get_vi_cost(get_fault_values(full_summary,"LE_ET"),
                            SESSION_TIME * SAL_LE)

tech_and_exp_consequences <- c(et_me_outcomes, et_le_outcomes)

choices_tech_and_exp_regular<-Choices(choice_ids=choice_ids, gamble_ids=gamble_ids,
                                      outcome_ids=outcome_ids,
                                      objective_consequences=tech_and_exp_consequences,
                                      probability_strings=probability_strings)

plot_choice_problem(choices_tech_and_exp_regular, c(ME_STR, LE_STR, "E & T"))
Outcomes for choosing ET with the different experiences and regular cost of fixing faults.

Figure 3.3: Outcomes for choosing ET with the different experiences and regular cost of fixing faults.

Based on Figure 3.3, and similarly to our choice problems on CP2, the levels of experience does not seem to differ significantly in prospects. However, other factors can influence this decisions, such as recruitment costs, etc.

For instance, the costs of a fault can significantly change this scenario. In Figure 3.4, we calculate prospects in a hypothetical critical scenario where the cost of a fault is much higher (e.g., $1000). As a consequence, hiring more experienced testers becomes more profitable and compensates their higher salaries.

# New hypothetical cost of a fault in, e.g., testing a critical system.
CRIT_FAULT_COST <- 1000

get_vi_cost_critical <- function(tp, session_costs){
  total_fault_cost <- (tp * CRIT_FAULT_COST) - session_costs
}

#--------------------------------------------------------------------
# Same steps as before, but using the critical fault cost value.
#--------------------------------------------------------------------
et_me_outcomes<-get_vi_cost_critical(get_fault_values(full_summary,"ME_ET"),
                                     SESSION_TIME * SAL_ME)
et_le_outcomes<-get_vi_cost_critical(get_fault_values(full_summary,"LE_ET"),
                                     SESSION_TIME * SAL_LE)

tech_and_exp_consequences <- c(et_me_outcomes, et_le_outcomes)

choices_critical_faults <-Choices(choice_ids=choice_ids, gamble_ids=gamble_ids,
                                  outcome_ids=outcome_ids,
                                  objective_consequences=tech_and_exp_consequences,
                                  probability_strings=probability_strings)

plot_choice_problem(choices_critical_faults , c(ME_STR, LE_STR, "E & T"))
Differences in the experience when considering the costs of critical faults.

Figure 3.4: Differences in the experience when considering the costs of critical faults.


3.4 Calculating Utility

Next, we use the choice models to calculate utility values. The function comparePT returns an overview of values and risks surrounding the choices, based on parameterized models that represent decision maker’s preferences. We refer to Tversky and Kahneman (1992) for the details about the models and how to calculate the prospect utility. Our goal is to compare and discuss the utility values.

# Alpha, beta, lambda and prob. weights according to Kahneman & Tversky (1992)
tk_1992_utility <- Utility(fun="power", par=c(alpha=0.88, beta=0.88, lambda=2.25))
log_odds_prob_weight <- ProbWeight(fun="linear_in_log_odds", 
                                   par=c(alpha=0.61, beta=0.724))

cpt_techniques <- comparePT(choices_techniques,
                            prob_weight_for_positive_outcomes=log_odds_prob_weight,
                            prob_weight_for_negative_outcomes=log_odds_prob_weight,
                            utility=tk_1992_utility, digits=4)

cpt_exp_critical <- comparePT(choices_critical_faults,
                              prob_weight_for_positive_outcomes=log_odds_prob_weight,
                              prob_weight_for_negative_outcomes=log_odds_prob_weight,
                              utility=tk_1992_utility, digits=4)

cpt_exp_regular <- comparePT(choices_tech_and_exp_regular,
                             prob_weight_for_positive_outcomes=log_odds_prob_weight,
                             prob_weight_for_negative_outcomes=log_odds_prob_weight,
                             utility=tk_1992_utility, digits=4)

# Merge all tables into one and remove unused columns. 
# Even though we do not use in our analysis, the pt package also provides information
# about certainty effects and risk premium about the choice problems.
merged_table <- bind_rows(cpt_techniques, cpt_exp_regular, cpt_exp_critical) %>%
  select(-c(cid,ce,rp)) %>% rename("Levels" = "gid", "Expected Value" = "ev", 
                                   "Prospect Utility" = "pt")
Table 3.1: Comparing utility values between choice problems
Levels Expected Value Prospect Utility
CP1 - Between ET (1) or TCT (2):
1 1070 454.3
2 20.21 8.017
CP2 - Between more (1) or less (2) exp. for regular fault cost:
1 700 305.9
2 650 290.4
CP3 - Between more (1) or less (2) exp. for critical fault cost:
1 9200 3018
2 6600 2256

Note that the expected values in Table 3.1 are simply the average of the obtained levels in the decision trees above (Figures 3.1, 3.2, 3.3 and 3.4). However, they are not adjusted to how humans behave when chosing, hence we compare the prospect utility instead. Note that our choices discussed above agree with the utility values.

In conclusion, a practitioner would choose as following:

  • CP1: A manager should choose ET over TCT, since ET is more profitable when finding and fixing faults.

  • CP2: Choosing between more and less experienced testers depends on the costs of fixing a fault found after release, as opposed to their individual performance in finding faults. For faults with lower costs, more experienced testers do not add value to the fault detection process.

  • CP3: Given that we are using ET to find faults, the difference in profit based on the experience is small. Similarly to CP2, the decision to hire more experienced testers is connected to the cost of fixing a fault, rather than the choice of testing technique.


4 Evaluation With Practitioners

We compare our approach using BDA + CPT to the traditional results of reporting hypothesis testing in terms of \(p\)-values. We want to see what happens when managers need to make a decision based on both types of descriptions. Therefore, we presented to practitioner our decision trees and the results of the hypotheses testing reported in the original study by Afzal et al. (2015) and asked them:

  • Q1.1: Based on the presented information, would you introduce exploratory testing? (Yes/No);
  • Q1.2: How confident are you on your decision?
  • Q2.1: Would you utilize more senior testers? (Yes/No).
  • Q2.2: How confident are you on your decision?

Our sample is composed of \(N = 15 + 7\) mid to upper managers at two large companies in Sweden. We removed \(3 + 2\) cases of missing data for a total of \(18\) valid participant’s answers. The survey instrument is available here , and the data is presented in Table 4.1

raw_validation_df <- read_xlsx("datasets/data.xlsx")

# We'll make use of the columns "Years in company" and "A", "B", "D", "E", "G", 
# "H", "J", and "K". We rename columns to correpond to our levels (ET, TCT, LE and ME)

# We rename the columns to represent the Traditional vs. Bayesian description
# (respectively, suffix TRAD and BA.). Columns ending with "L" are answers to 
# the participant's confidence in a likert scale.
validation_df <- raw_validation_df %>% select("Experience" = "Years in company", 
                                              "ET_TRAD" = "A", "ET_TRAD_L" = "B",
                                              "ME_TRAD" = "D", "ME_TRAD_L" = "E", 
                                              "ET_BA"   = "G", "ET_BA_L"   = "H", 
                                              "ME_BA"   = "J", "ME_BA_L"   = "K")
validation_df <- validation_df %>% drop_na()
Table 4.1: Participant’s answers.
Experience ET_TRAD ET_TRAD_L ME_TRAD ME_TRAD_L ET_BA ET_BA_L ME_BA ME_BA_L
13 1 3 1 2 1 5 1 5
7 1 4 1 4 1 5 1 5
24 1 3 1 3 1 3 1 3
19 1 3 1 2 1 4 1 4
1 1 3 1 3 1 4 1 3
32 1 3 1 3 1 3 1 3
8 1 4 0 4 1 5 1 5
25 1 3 1 4 1 4 1 5
20 1 2 1 4 1 3 1 3
10 1 4 0 3 1 4 1 4
3 1 4 0 4 1 5 1 5
3 1 3 1 4 1 4 1 4
6 1 3 1 4 1 2 1 4
2 1 2 1 2 1 4 1 4
12 1 5 1 4 1 3 1 5
33 0 4 1 4 0 3 1 4
24 1 3 1 4 1 3 1 3
2 0 5 0 5 0 5 1 5

Figure 4.1 shows that the decision itself (Q1.1) is not influenced by the description provided, but participants seem more confident when using our decision tree from the CPT analysis.

likert_scores= c("1", "2", "3", "4", "5")
likert_texts = c("1" = "Not at all confident", "2" = "Slightly confident",
                 "3" = "Somewhat confident"  , "4" = "Fairly confident", 
                 "5" = "Completely confident")

yesno_scores= c("0", "1")
yesno_texts = c("0" = "No", "1" = "Yes")

# Diverging plots for the yes/no answers on Q1.1
q1_yesno_answers <- validation_df %>% 
  select("Traditional results" = ET_TRAD, "CPT results"= ET_BA) %>% 
  mutate_all(~factor(., levels = yesno_scores)) %>% as.data.frame

results_q1_yesno<- likert(q1_yesno_answers)
q1_yesno_plot   <- create_diverging_plot(results_q1_yesno, yesno_texts, 
                                         "Would you introduce ET?")

# Diverging plots for the likert scales on Q1.2
q1_likert_answers <- validation_df %>% 
  select("Traditional results" = ET_TRAD_L, "CPT results"= ET_BA_L) %>%
  mutate_all(~factor(., levels = likert_scores)) %>% as.data.frame

results_q1_likert<- likert(q1_likert_answers)
q1_likert_plot  <- create_diverging_plot(results_q1_likert, likert_texts, 
                                         "How confident are you in your decision?")

# Combine both plots into a single figure.
grid.arrange(q1_yesno_plot, q1_likert_plot, ncol = 1, nrow = 2)
Diverging plot on decisions related to introducing ET.

Figure 4.1: Diverging plot on decisions related to introducing ET.

Regarding the decision to hire more senior testers (Q2.1), more participants chose to hire senior testers when presented with the results of our CPT analysis, as opposed to \(p\)-values. Similarly to Q1.2, participants were also more confident in their decision when reading results from our decision tree.

# We repeat the steps above, but now for hiring senior tersters (Q2)

# First, the yes / no questions (Q2.1)
q2_yesno_answers <- validation_df %>% 
  select("Traditional results" = ME_TRAD, "CPT results"= ME_BA) %>%
  mutate_all(~factor(., levels = yesno_scores)) %>% as.data.frame

results_q2_yesno<- likert(q2_yesno_answers)
q2_yesno_plot <- create_diverging_plot(results_q2_yesno, yesno_texts,
                                       "Would you hire more senior testers?")

# Then, the likert scale answers (Q2.2)
q2_likert_answers <- validation_df %>% 
  select("Traditional results" = ME_TRAD_L, "CPT results"= ME_BA_L) %>%
  mutate_all(~factor(., levels = likert_scores)) %>% as.data.frame

results_q2_likert <- likert(q2_likert_answers)
q2_likert_plot <- create_diverging_plot(results_q2_likert, likert_texts,
                                        "How confident are you in your decision?")

# Lastly, combine both plots into the same figure.
grid.arrange(q2_yesno_plot, q2_likert_plot, ncol = 1, nrow = 2)
Diverging plot on decisions related to hiring experienced testers

Figure 4.2: Diverging plot on decisions related to hiring experienced testers

In conclusion, practitioners take somewhat different decisions depending on the format used to report results (in particular in unclear cases). More importantly, the format affects the confidence behin those decisions. For instance, more or less the same proportion of practitioners would adopt , however they reported being more confident with their decisions when presented with the CPT results (Figure 4.1).


5 References

Afzal, Wasif, Ahmad Nauman Ghazi, Juha Itkonen, Richard Torkar, Anneliese Andrews, and Khurram Bhatti. 2015. “An Experiment on the Effectiveness and Efficiency of Exploratory Testing.” Empirical Software Engineering 20 (3): 844–78. https://doi.org/10.1007/s10664-014-9301-4.

Au, Gary. 2014. Pt: Computational Models for Prospect Theory and Other Theories of Risky Decision Making. https://CRAN.R-project.org/package=pt.

Bürkner, Paul-Christian. 2019. Brms: Bayesian Regression Models Using ’Stan’. https://CRAN.R-project.org/package=brms.

Gabry, Jonah, and Tristan Mahr. 2019. Bayesplot: Plotting for Bayesian Models. https://CRAN.R-project.org/package=bayesplot.

Makowski, Dominique, Daniel Lüdecke, Mattan S. Ben-Shachar, and Michael D. Wilson. 2020. BayestestR: Understand and Describe Bayesian Models and Posterior Distributions. https://CRAN.R-project.org/package=bayestestR.

Tversky, Amos, and Daniel Kahneman. 1992. “Advances in Prospect Theory: Cumulative Representation of Uncertainty.” Journal of Risk and Uncertainty 5 (4): 297–323. https://doi.org/10.1007/BF00122574.


6 Appendix

Here we present the code for some utils functions used to simplify our R code and save space in this document. Most of the code here is used to simplify data wragling using dplyr and plotting using ggplot.

# This is a constant with our desired theme used in all plots.
PLOT_THEME <- theme_minimal(base_size = 10) + 
        theme(plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
        panel.grid.minor = element_blank(), panel.grid = element_line(size=0.3),
        axis.line = element_line(size = 1), axis.ticks = element_line(size = 1),
        strip.background = element_rect(colour = "white", fill = "aliceblue"),
        legend.position = "bottom")

# Extracts and plots the priors sampling when running the models.
prior_analysis <- function(model, chosen_parameters = NULL){
  #Sampling is done in a log scale, we must then invert back to the outcome scale.
  long_df <- posterior_samples(model) %>% inv_logit_scaled()
  
  #Plot only few parameters in case the model has too many.
  if(!is.null(chosen_parameters)){
    long_df <- long_df %>%  select(chosen_parameters)
  }
  
  long_df <- long_df %>% gather(key = "Parameter", value = "Value")
  mean_values <- long_df %>% group_by(Parameter) %>% summarise("Mean" = mean(Value))
  
  ggplot(long_df, aes(x=Value, fill = Parameter)) + geom_density(color = "black") +
    geom_vline(mean_values, mapping = aes(xintercept = Mean), 
               linetype = "dashed", size = 1, color = "gray37") +
    scale_fill_viridis_d() + labs(x="Sampled values", y="Density", fill="Parameters: ") +
    facet_wrap(~Parameter, nrow=2, scales = "free") + 
    PLOT_THEME + theme(legend.position = "None")
}

# Funtion to plot HDI of the posterior (wrapper around mcmc_areas function)
plot_posterior_region <- function(posterior, model_var, y_labels,
                                  ci_width = 0.94){
  region_plot <- mcmc_areas(posterior, pars = c(model_var), prob = ci_width) +
    geom_vline(xintercept = 0, linetype = "dashed") + 
    scale_y_discrete( labels = y_labels) + labs(x = "Sampled values") + PLOT_THEME
  return(region_plot)
}

# Utility function to convert predictions for different levels 
# into individual levels, to comapre marginal effects.
get_individual_levels <- function(all_predictions){
  #1. Turn it into a long table and then separate the combined levels into two columns
  long_df_pred <- all_predictions %>% 
    gather(key = "CombinedLevels", value = "tp") %>%
    separate("CombinedLevels", into = c("Experience","Technique"), sep = "_") %>% 
    mutate_at(.vars = c("Technique","Experience"),.funs = factor)
  long_df_pred
  
  #2. Filter the fault values per individual level
  faults_me <- long_df_pred %>% filter(Experience == "ME" ) %>% select("ME" = tp)
  faults_le <- long_df_pred %>% filter(Experience == "LE" ) %>% select("LE" = tp)
  faults_et <- long_df_pred %>% filter(Technique  == "ET" ) %>% select("ET" = tp)
  faults_tct<- long_df_pred %>% filter(Technique  == "TCT") %>% select("TCT"= tp)
  
  #3. Merges each column into a single data frame.
  individual_levels <- bind_cols(faults_me, faults_le, faults_et, faults_tct)
  return(individual_levels)
}

# Utility function to plot our choice problem. 
# Its simply a wrapper around the function provided by the pt package.
plot_choice_problem <- function(choice_problem, chosen_labels, filename = NULL){
  drawChoices(choice_problem,
              decision_square_x=0.2, decision_square_edge_length=0.1,
              circle_radius=0.05   , y_split_gap=0.1, x_split_offset=0.03,
              probability_text_digits=3, y_probability_text_offset=0.015,
              y_value_text_offset=0.005, x_value_text_offset=0.025,
              probability_text_font_colour="black", probability_text_font_size=11,
              objective_consequence_text_font_colour="black",
              objective_consequence_text_font_size=11,
              label=chosen_labels,
              label_font_colour=c("black","black", "black"),
              label_font_size=c(11,11,18),
              label_positions=list(c(0.380,0.65),c(0.380,0.35),c(0.2,0.5)))
}

# Creates customised diverging plots from a likert object.
create_diverging_plot <- function(likert_results, answer_texts, title = ""){
  divergin_plot <- plot(likert_results) + ggtitle(title) + 
    scale_fill_viridis_d(drop = F, name = "Answers: ", labels = answer_texts) +
    PLOT_THEME + theme(legend.position = "top")
  return(divergin_plot)
}